16 research outputs found
Programmable Synthetic Tabular Data Generation
Large amounts of tabular data remain underutilized due to privacy, data
quality, and data sharing limitations. While training a generative model
producing synthetic data resembling the original distribution addresses some of
these issues, most applications require additional constraints from the
generated data. Existing synthetic data approaches are limited as they
typically only handle specific constraints, e.g., differential privacy (DP) or
increased fairness, and lack an accessible interface for declaring general
specifications. In this work, we introduce ProgSyn, the first programmable
synthetic tabular data generation algorithm that allows for comprehensive
customization over the generated data. To ensure high data quality while
adhering to custom specifications, ProgSyn pre-trains a generative model on the
original dataset and fine-tunes it on a differentiable loss automatically
derived from the provided specifications. These can be programmatically
declared using statistical and logical expressions, supporting a wide range of
requirements (e.g., DP or fairness, among others). We conduct an extensive
experimental evaluation of ProgSyn on a number of constraints, achieving a new
state-of-the-art on some, while remaining general. For instance, at the same
fairness level we achieve 2.3% higher downstream accuracy than the
state-of-the-art in fair synthetic data generation on the Adult dataset.
Overall, ProgSyn provides a versatile and accessible framework for generating
constrained synthetic tabular data, allowing for specifications that generalize
beyond the capabilities of prior work
FARE: Provably Fair Representation Learning with Practical Certificates
Fair representation learning (FRL) is a popular class of methods aiming to
produce fair classifiers via data preprocessing. Recent regulatory directives
stress the need for FRL methods that provide practical certificates, i.e.,
provable upper bounds on the unfairness of any downstream classifier trained on
preprocessed data, which directly provides assurance in a practical scenario.
Creating such FRL methods is an important challenge that remains unsolved. In
this work, we address that challenge and introduce FARE (Fairness with
Restricted Encoders), the first FRL method with practical fairness
certificates. FARE is based on our key insight that restricting the
representation space of the encoder enables the derivation of practical
guarantees, while still permitting favorable accuracy-fairness tradeoffs for
suitable instantiations, such as one we propose based on fair trees. To produce
a practical certificate, we develop and apply a statistical procedure that
computes a finite sample high-confidence upper bound on the unfairness of any
downstream classifier trained on FARE embeddings. In our comprehensive
experimental evaluation, we demonstrate that FARE produces practical
certificates that are tight and often even comparable with purely empirical
results obtained by prior methods, which establishes the practical value of our
approach.Comment: ICML 202
Certified Defenses: Why Tighter Relaxations May Hurt Training
Certified defenses based on convex relaxations are an established technique
for training provably robust models. The key component is the choice of
relaxation, varying from simple intervals to tight polyhedra. Paradoxically,
however, training with tighter relaxations can often lead to worse certified
robustness. The poor understanding of this paradox has forced recent
state-of-the-art certified defenses to focus on designing various heuristics in
order to mitigate its effects. In contrast, in this paper we study the
underlying causes and show that tightness alone may not be the determining
factor. Concretely, we identify two key properties of relaxations that impact
training dynamics: continuity and sensitivity. Our extensive experimental
evaluation demonstrates that these two factors, observed alongside tightness,
explain the drop in certified robustness for popular relaxations. Further, we
investigate the possibility of designing and training with relaxations that are
tight, continuous and not sensitive. We believe the insights of this work can
help drive the principled discovery of new and effective certified defense
mechanisms
Data Leakage in Tabular Federated Learning
While federated learning (FL) promises to preserve privacy in distributed
training of deep learning models, recent work in the image and NLP domains
showed that training updates leak private data of participating clients. At the
same time, most high-stakes applications of FL (e.g., legal and financial) use
tabular data. Compared to the NLP and image domains, reconstruction of tabular
data poses several unique challenges: (i) categorical features introduce a
significantly more difficult mixed discrete-continuous optimization problem,
(ii) the mix of categorical and continuous features causes high variance in the
final reconstructions, and (iii) structured data makes it difficult for the
adversary to judge reconstruction quality. In this work, we tackle these
challenges and propose the first comprehensive reconstruction attack on tabular
data, called TabLeak. TabLeak is based on three key ingredients: (i) a softmax
structural prior, implicitly converting the mixed discrete-continuous
optimization problem into an easier fully continuous one, (ii) a way to reduce
the variance of our reconstructions through a pooled ensembling scheme
exploiting the structure of tabular data, and (iii) an entropy measure which
can successfully assess reconstruction quality. Our experimental evaluation
demonstrates the effectiveness of TabLeak, reaching a state-of-the-art on four
popular tabular datasets. For instance, on the Adult dataset, we improve attack
accuracy by 10% compared to the baseline on the practically relevant batch size
of 32 and further obtain non-trivial reconstructions for batch sizes as large
as 128. Our findings are important as they show that performing FL on tabular
data, which often poses high privacy risks, is highly vulnerable
Data Leakage in Federated Averaging
Recent attacks have shown that user data can be recovered from FedSGD
updates, thus breaking privacy. However, these attacks are of limited practical
relevance as federated learning typically uses the FedAvg algorithm. Compared
to FedSGD, recovering data from FedAvg updates is much harder as: (i) the
updates are computed at unobserved intermediate network weights, (ii) a large
number of batches are used, and (iii) labels and network weights vary
simultaneously across client steps. In this work, we propose a new
optimization-based attack which successfully attacks FedAvg by addressing the
above challenges. First, we solve the optimization problem using automatic
differentiation that forces a simulation of the client's update that generates
the unobserved parameters for the recovered labels and inputs to match the
received client update. Second, we address the large number of batches by
relating images from different epochs with a permutation invariant prior.
Third, we recover the labels by estimating the parameters of existing FedSGD
attacks at every FedAvg step. On the popular FEMNIST dataset, we demonstrate
that on average we successfully recover >45% of the client's images from
realistic FedAvg updates computed on 10 local epochs of 10 batches each with 5
images, compared to only <10% using the baseline. Our findings show many
real-world federated learning implementations based on FedAvg are vulnerable
Learning Certified Individually Fair Representations
Fair representation learning provides an effective way of enforcing fairness
constraints without compromising utility for downstream users. A desirable
family of such fairness constraints, each requiring similar treatment for
similar individuals, is known as individual fairness. In this work, we
introduce the first method that enables data consumers to obtain certificates
of individual fairness for existing and new data points. The key idea is to map
similar individuals to close latent representations and leverage this latent
proximity to certify individual fairness. That is, our method enables the data
producer to learn and certify a representation where for a data point all
similar individuals are at -distance at most , thus
allowing data consumers to certify individual fairness by proving
-robustness of their classifier. Our experimental evaluation on five
real-world datasets and several fairness constraints demonstrates the
expressivity and scalability of our approach.Comment: Conference Paper at NeurIPS 202
Efficient Certification of Spatial Robustness
Recent work has exposed the vulnerability of computer vision models to vector
field attacks. Due to the widespread usage of such models in safety-critical
applications, it is crucial to quantify their robustness against such spatial
transformations. However, existing work only provides empirical robustness
quantification against vector field deformations via adversarial attacks, which
lack provable guarantees. In this work, we propose novel convex relaxations,
enabling us, for the first time, to provide a certificate of robustness against
vector field transformations. Our relaxations are model-agnostic and can be
leveraged by a wide range of neural network verifiers. Experiments on various
network architectures and different datasets demonstrate the effectiveness and
scalability of our method.Comment: Conference Paper at AAAI 202
Robustness Certification for Point Cloud Models
The use of deep 3D point cloud models in safety-critical applications, such
as autonomous driving, dictates the need to certify the robustness of these
models to real-world transformations. This is technically challenging, as it
requires a scalable verifier tailored to point cloud models that handles a wide
range of semantic 3D transformations. In this work, we address this challenge
and introduce 3DCertify, the first verifier able to certify the robustness of
point cloud models. 3DCertify is based on two key insights: (i) a generic
relaxation based on first-order Taylor approximations, applicable to any
differentiable transformation, and (ii) a precise relaxation for global feature
pooling, which is more complex than pointwise activations (e.g., ReLU or
sigmoid) but commonly employed in point cloud models. We demonstrate the
effectiveness of 3DCertify by performing an extensive evaluation on a wide
range of 3D transformations (e.g., rotation, twisting) for both classification
and part segmentation tasks. For example, we can certify robustness against
rotations by 60{\deg} for 95.7% of point clouds, and our max pool
relaxation increases certification by up to 15.6%.Comment: International Conference on Computer Vision (ICCV) 202
Latent Space Smoothing for Individually Fair Representations
Fair representation learning encodes user data to ensure fairness and
utility, regardless of the downstream application. However, learning
individually fair representations, i.e., guaranteeing that similar individuals
are treated similarly, remains challenging in high-dimensional settings such as
computer vision. In this work, we introduce LASSI, the first representation
learning method for certifying individual fairness of high-dimensional data.
Our key insight is to leverage recent advances in generative modeling to
capture the set of similar individuals in the generative latent space. This
allows learning individually fair representations where similar individuals are
mapped close together, by using adversarial training to minimize the distance
between their representations. Finally, we employ randomized smoothing to
provably map similar individuals close together, in turn ensuring that local
robustness verification of the downstream application results in end-to-end
fairness certification. Our experimental evaluation on challenging real-world
image data demonstrates that our method increases certified individual fairness
by up to 60%, without significantly affecting task utility